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Estimating covariance matrices is a problem of fundamental im- 
portance in multivariate statistics. In practice it is increasingly fre- 
quent to work with data matrices X of dimension nxp, where p and n 
are both large. Results from random matrix theory show very clearly 
that in this setting, standard estimators like the sample covariance 
matrix perform in general very poorly. 

In this "large n, large p" setting, it is sometimes the case that 
practitioners are willing to assume that many elements of the pop- 
ulation covariance matrix are equal to 0, and hence this matrix is 
sparse. We develop an estimator to handle this situation. The esti- 
mator is shown to be consistent in operator norm, when, for instance, 
we have p x n as n — » oo. In other words the largest singular value of 
the difference between the estimator and the population covariance 
matrix goes to zero. This implies consistency of all the eigenvalues 
and consistency of eigenspaces associated to isolated eigenvalues. 

We also propose a notion of sparsity for matrices, that is, "com- 
patible" with spectral analysis and is independent of the ordering of 
the variables. 

1. Introduction. Estimating covariance matrices is the cornerstone of 
much of multivariate statistics. Theoretical contributions (see [2], Chapter 
7, [14, 18]) have been highlighting for a long time the fact that for various 
loss functions, one could improve on the sample covariance matrix as an 
estimator of the population covariance matrix, in a nonasymptotic setting. 

The "large n, large p" context, that is, multivariate analysis of datasets 
for which both the number of observations, n, and the number of variables, 
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p, are large, is, in a somewhat different setting, highlighting the deficiency 
of this estimator. To be more precise, when we refer to "large n, large p" 
problems, we generally mean that pxu, that is, p = O(n) and n = 0(p), and 
p — > oo. If p/n has a nonzero limit as n — > oo, results from random matrix 
theory [21] make clear that in this asymptotic setting, even at just the level 
of eigenvalues, the sample covariance matrix will not lead to a consistent 
estimator. We refer to [12] for a more thorough introduction to these ideas 
and the consequences of the results for statistical practice. 

This is naturally very problematic since this class of results suggests that 
the sample covariance matrix contains little reliable information about the 
population covariance. This realization has helped generate a significant 
amount of work in mathematics, probability and theoretical statistics and 
the behavior of many hard to analyze quantities is now quite well under- 
stood. For instance, under strong distributional assumptions, one can charac- 
terize the fluctuation behavior of the largest eigenvalue of sample covariance 
matrices for quite a large class of population covariance (see, e.g., [11] for 
recent results) , or the fluctuation behavior of linear functionals of eigenval- 
ues (see [1, 3, 19]). However, until very recently there has been less work in 
the direction of using these powerful results for the sake of concrete data 
analysis. 

Of course, since this inconsistency phenomenon is now fairly well-known, 
remedies have been proposed. For instance, the interesting paper [20] pro- 
poses to shrink the sample covariance matrix toward the identity matrix 
using a shrinkage parameter chosen from the data. In [12], a nonparametric 
estimator of the spectrum is proposed and shown to be consistent in the 
sense of weak convergence of distributions. The method in [12] uses convex 
optimization, random matrix theory (the generalization of [21] found in [22]) 
and ideas from nonparametric function estimation. These estimation meth- 
ods rely on asymptotic properties of eigenvalues, and as a starting point for 
estimation of the full covariance matrix, they are essentially trying to get an 
estimator, that is, equivariant under the action of the orthogonal (or uni- 
tary) group. In other words, the "basis" in which the data are given is not 
taken advantage of, and the premise of such an analysis is that we should be 
able to find good estimators in any "basis." While ideally researchers will 
be able to come up with strategies to solve the estimation problem at this 
level of generality, it is reasonable to expect that taking advantage of the 
representation of the data we are given should or might help finding good 
estimators. 

In particular, it is often the case that data analysts are willing to as- 
sume that the basis in which the data are given is somewhat nice. Often 
this translates into assumption that the population covariance matrix has a 
particular structure in this basis, which should naturally be taken advantage 
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of. In this situation, it becomes natural to perform certain forms of regular- 
ization by working directly on the entries of the sample covariance matrix. 
Various strategies have been proposed (see [4, 17]) that try to take advan- 
tage of the assumed structure. The very interesting paper [7] proposed the 
idea of "banding" covariance matrices when it is known that the population 
covariance has small entries far away from the diagonal. The idea is to put 
to zero all coefficients that are too far away from the diagonal and to keep 
the other ones unchanged. Remarkably, in [7], the authors show consistency 
of their estimator in spectral (a.k.a. operator) norm, a very nice result. In 
other words, they show that the largest singular value of the difference be- 
tween their estimator and the population covariance matrix goes to zero 
as both dimensions of the matrices go to infinity and, for instance, when 
p/n has a finite limit. The requirement of estimating consistently in spectral 
norm is a very interesting idea (which we adopt in this paper), since then 
one can deduce easily many results concerning consistency of eigenvalues 
and eigenspaces. We make this remark more precise in Section 3.5, using 
different arguments than those used by Bickel and Levina in [7]. 

It might be argued that ideas such as banding essentially assume that one 
knows a "good" ordering of the variables. As a matter of fact, if we start 
with a matrix with entries small or zero away from the diagonal and reorder 
the variables, the new covariance matrix we obtain may not have only small 
entries away from the diagonal. In some situations, for instance, time series 
analysis, the order of the variables has a statistical/scientific meaning and 
so using it makes sense. However, in many data-analytic problems, there is 
no canonical ordering of the variables. 

Hence to tackle these situations, a natural requirement is to find an esti- 
mator which is equivariant under permutations of the variables. We call such 
estimators permutation-equivariant. Such an estimator would take advan- 
tage of the particular nature of the basis in which the data are given, while 
not requiring the user to find a permutation of the order of the variables 
that makes the analysis particularly simple. Searching for such a permuta- 
tion would — in general — be practically infeasible. Note that regularizing the 
estimator by applying the same function to each of the entries of the matrix 
leads to permutation-equivariant estimators. 

A subject of particular practical interest is the estimation of sparse co- 
variance matrices (see, e.g., [9]) because they are appealing to practitioners 
for several reasons, including interpretability, presumably ease of estimation 
and the practically often encountered situation where while many variables 
are present in the problem, most of them are correlated to only "a few" 
others. 

In this paper we propose to estimate sparse matrices by hard thresholding 
small entries of the sample covariance matrix and putting them to zero. We 
propose a combinatorial view of the problem, inspired in part by classical 
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ideas in random matrix theory, going back to [25] . The notion of sparsity we 
propose is flexible enough that it makes the proofs manageable and at the 
same time rich enough that it captures many practically natural situations. 

We show that our estimators are consistent in spectral norm, in the case of 
both the sample covariance and the sample correlation matrix. No assump- 
tions of normality of the data are required, only the existence of certain 
moments. As is to be expected, the larger the number of moments avail- 
able, the easier the task and the larger the class of matrices we can estimate 
consistently. 

Finally, we note that at the same time as we were researching these ques- 
tions and independently, similar questions were approached from a very 
different point of view by [8]. 

2. Sparse matrices: concepts and definitions. One conceptual difficulty 
of this problem is to define a notion of sparsity for matrices that is compatible 
with spectral analysis. Just as in the case of norms, extending straightfor- 
wardly the notions from vectors to matrices can be somewhat unhelpful. 
In the norm case, the Frobenius norm — the extension of the £2 (vector) 
norm to matrices — is, for instance, known to not be as good as other matrix 
norms from a spectral point of view. Similarly here, we will explain that just 
counting the number of O's in the matrix — the canonical sparsity notion for 
vectors — does not yield a "good" notion of sparsity when one investigates 
the spectral properties of matrices. 

Let us illustrate our problem on a concrete example. Consider now two 
p x p symmetric matrices with the same number of nonzero coefficients: 
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Using the Schur decomposition of E\ to compute its characteristic poly- 
nomial (see also Section 3.3), we see easily that its eigenvalues are (p — 2) 
l's and 1 + Vp~~ 



1/ VP an d 1 — VP ~ 1/ 'VP' ^ n the other hand, E<i is a 
well-known matrix, for instance, in numerical analysis, and its eigenvalues 
are {1 + 2cos(A:7r/(p-|- l))/y / p}fc =1 - Hence, the extreme eigenvalues of these 
matrices are very different, but they have the same number of nonzero coef- 
ficients and their nonzero coefficients have the same values. This raises the 
question of trying to come up with an alternative notion of sparsity that, 
while encompassing the canonical notion of "having a large number of ze- 
ros," might be better suited for the study and the understanding of spectral 
properties of matrices. 

2.1. Matrix sparsity: proposed definition. To describe our proposal, we 
need to introduce several concepts from graph theory and combinatorics. 
For the sake of readability we detail them here; they can also be found in, 
for instance, [23], Section 4.7. To each population covariance matrix, S p , it 
is natural to associate an adjacency matrix A p (Yi p ), in the following fashion: 

A P (i,j) = l CT (ij)^o- 

This matrix A p can in turn be viewed as the adjacency matrix of a graph 
Qp, with p vertices, corresponding to the variables in our statistical problem. 
We call a walk on this graph closed if it starts and finishes at the same vertex. 
The length of a walk is the number of edges it traverses. By definition, we 
call 

C p (k) = {closed walks of length k on the graph with adjacency matrix A p } 
and 



Note that we have 



<p p (k)=C&id{C p (k)}. 



b p (k) =trace(Ap). 
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The following two figures show the graphs corresponding to the adjacency 
matrices of E\ and E%: 




Definition 1. We say that a sequence of covariance matrices {Sp}2?_^ 
is /3-sparse if the graphs associated to them via ^4 p 's have the property that 

Vk £ 2N <p p (k) < f(k)^ k ~ 1)+1 , 

where f(k) £ M + is independent of p and < j3 < 1. 

We say that a sequence of matrices is asymptotically /3-sparse if it is (3 + e 
sparse for any e > 0. 

We call /3 an index of sparsity of the sequence of matrices. 

For short, we say that a matrix is /3-sparse instead of saying that a se- 
quence of matrices is /3-sparse when this shortcut does not cause any con- 
fusion. 

Here are a few simple examples of matrices that are sparse according to 
our definition: 

1. Diagonal matrices. In the case of diagonal matrices, A p = Id p , and Q p 
consists only of self-loops at each vertex. Hence 4>(k) = p, for all k. So a 
diagonal matrix is 0-sparse. 

2. Matrices with at most M nonzero elements on each line. For these ma- 
trices, the corresponding Q p has at most M edges at each vertex. A sim- 
ple enumeration shows that (f)(k) < pM k ~ 1 . So these matrices are also 
0-sparse. 

3. Matrices with at most Mp a nonzero elements on each line. The same 
argument shows that (f){k) < p(Mp) a ^ k ~ 1 \ So these matrices are a-sparse. 
In particular, full matrices are 1-sparse. 
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4. Matrices with at most M (log p) r nonzero elements on each line. We 

have, by simple counting arguments, 4> p (k) < pM k ~ 1 (logp) r ( k ~ 1 \ These 
matrices are therefore /3-sparse for any (5 > and asymptotically 0-sparse. 

Given a matrix S p , we can associate to the corresponding Q p a set of 
weights on the edges, by simply setting the weight of the edge joining vertices 
i and j to S p (i,j). Similarly, for a walk, we have 

Definition 2 (Weight of a walk). Given 7, a closed walk of length k: 
7 : i\ — > i2 — ► 13 — > • • • — ► ik — > ik+i = h, and a matrix S p , we call w 7 the 
weight of the walk 7. By definition it is 

w 7 = S p (ii,i 2 )S p (i 2 ,i3) ■ •■S p (i k ,ii). 

We conclude this section by the following simple but important remark: 

trace(Sp) = w-y. 

7 ec p (fe) 

2.2. Remarks on the notion of sparsity proposed. It is clear that if we 
change the order of the variables in our statistical problem, we do not change 
the "index of sparsity" of our matrices. This is essentially obvious from the 
graph representation of the problem. From a more algebraic standpoint, if 
the permutation that is applied is encoded as a permutation matrix P, the 
covariance in the permuted problem is simply P'Yi p P and the new adjacency 
matrix is P'A p P (this matrix is indeed an adjacency matrix). Since P'P = 
Id p , we have tr&ce((P' A p P) k ) =trace(^4p), and hence the sparsity index is 
unchanged when we permute the variables. 

We also note that we could replace the notion of /3-sparsity we use by the 
requirement that, for some C > 0, 

4> p (k) < C k p l+pk Vk G 2N, 

which is equivalent, if ||| • ||| 2 represents the operator norm or largest singular 
value of a matrix, to 

P P lll2<cV- 

This would result in minor differences in the theorems that follow and might 
be slightly simpler to apply when the only information available concerns 
the largest eigenvalue of Ap. From a combinatorial point of view, the notion 
we use in this paper is a bit more natural and this is what directed our 
choice. 

Finally, we also note that we could replace our notion of sparsity by 
VA: < ko, k e 2N, cj> p {k) < f{k)p^ k ~^ +1 , 
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and call the matrices having this property /3-sparse at order kg. The proofs 
below would still be valid provided ko is large enough, the minimum required 
size for ko being related to the number of moments of the random variables 
making up our data matrix. 

Let us now return to the notion of /3-sparsity proposed in Definition 1. It 
is clear that the smaller (5 is, the sparser the matrix. In particular, if (3q < (3\, 
a matrix which is /3o-sparse is also /3i-sparse. As we will shortly show, the 
class of /3-sparse matrices is stable by addition, which implies that the sum 
of a /?o-sparse and a /3i-sparse matrix is ((3q V /?i)-sparse. 

We conclude this discussion with two properties of /3-sparse matrices. 

Fact 1. The set of [5-sparse matrices is stable by addition. In other 
words, the sum of two (3-sparse matrices is (3-sparse. 

Proof. We call Bq and B\ our "initial" /3-sparse matrices, and B2 
their sum. A2, the adjacency matrix of B2, is not the sum of Aq + A\. In 
particular, edges that are present in both Aq & n d A\ may not be present 
in A2. However, if we add edges to A2, we increase (jr p (k), the number of 
closed walks of length on ^2- So in checking the sparsity index of B2 , we 
can work with A2, which contains all edges in Aq and A\, and contains the 
graph corresponding to A2 as a subgraph of its own graphical representation. 
More algebraically, the definition of A2 is 

A 2 (i,j)=imn(A (i,j)+A 1 (i,j),l) = l Al{iJ)=1 + lA (i,i)=i 1 A 1 (i,j)=o- 

We can write A2 = Aq + A\, with Ao(i,j) = ^A (i,j)=i^A 1 (i,j)=o- Note that Aq 
is a symmetric adjacency matrix, may have zeroes where Aq has ones, but 
does not have ones where Aq has zeroes. So the graph corresponding to Aq 
is a subgraph of the graph corresponding to Aq. In particular, trace(j4Q fc ) < 
trace 

The matrices Aq, A\ and A2 are all symmetric, so when dealing with their 
eigenvalues we can apply standard results for symmetric matrices. Using 
Lidskii's theorem (see [6], Corollary III. 4. 2), we know that 

A i (l 2 )-<A i (io) + A i (^i), 

where \^(A\) is the vector of decreasing eigenvalues of Ai and the sign -< 
means that the left-hand side is majorized by the right-hand side (see [6], 
page 28 for a definition, if needed). Now the functions h(x) = x 2k are convex 
and we therefore have, using standard results in the theory of majorization 
([6], Theorem II.3.1), 

traced) < ^[A 3 (l ) + Xj(Ai)] 2k < 2 2k ^ £ A,(I ) 2fc + A^) 2 * 

< 2 2k ~ l traced + Af). 
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Because Aq and A\ are /3-sparse, we see that A?, k is. And because we have 
seen that 

tvace(A 2 . k ) < trace(,4| fc ), 
we conclude that B2 is /3-sparse. □ 

Fact 2. The set of (3 -sparse matrices is not stable by inversion or mul- 
tiplication. 

Proof. To prove this fact, it suffices to provide an example. Let us 
consider 
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As explained in Section 3.3 below, this matrix is a rank-2 perturbation of 
Id p , and its eigenvalues and eigenvectors are known. Note that S p is 1/2- 
sparse. Also, if a < 1/y/p — 1, S p is a positive semidefinite matrix, and hence 
is a covariance matrix. 

Using the expressions for eigenvalues and eigenvectors found below, we 
see that Y> 2 is full of nonzero entries, and hence is 1-sparse. As a matter 
of fact, it is easily checked that if i > j > 2, Y, 2 (i,j) = a 2 . So there is no 
stability by multiplication, for otherwise T, 2 would be 1/2-sparse. 

Using the classic expression for inverses of low-rank perturbations of ma- 
trices found, for example, in [16], page 19, we see that S" 1 , when it exists, is 
full of nonzero entries and hence is 1-sparse. As a matter of fact, it is easily 
checked that if a 2 7^ l/(p-l), and i > j > 2, S~ j) = -a 2 /(a 2 (p-l)-l). 
So there is no stability by inversion either. □ 

3. Estimation by entrywise thresholding. To avoid any confusion as to 
the meaning of the results to be proved, we remind the reader that the spec- 
tral norm of a matrix A is defined (see [16], page 295) as |||-A|||2 = max{\/A : A 
an eigenvalue of A* A}; in other words, it is the largest singular value of A. 
When A is a symmetric matrix, |||A|||2 coincides with the spectral radius of 
A: p(A) = maxj | (^4) | - In what follows, we use interchangeably the terms 
spectral norm and operator norm. 

When we say that we threshold a variable x at level t we mean that 
we keep (or replace x by) xh x \> t . We also refer to this operation as hard 
thresholding. Our final remark concerns notation: in what follows, C refers 
to a generic constant independent of n and p. Its value may change from 
display to display when there is no risk of confusion about the meaning of 
the statements. If there is, we also use K or C and they play the same role 
as C. 
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3.1. Estimation of sparse covariance or correlation matrices. We first 
prove an intermediate result concerning the Gaussian MLE estimator when 
it is known that the mean of the data is zero (Theorem 1). This is a stepping 
stone to the more practically relevant results concerning the sample covari- 
ance matrix (Theorem 2) and the sample correlation matrix (Theorem 3). 
The proofs of these later results are essentially the same as that of Theo- 
rem 1 , but the proof of Theorem 1 is technically a bit less complicated and 
highlights the key ideas. In Section 3.2, we explain how these results can be 
extended to nonsparse matrices that are approximable by sparse matrices. 
In Section 3.2.1, we discuss a strengthening of Theorems 1, 2 and 3, whose 
possibility was suggested to us by an insightful question of Professor Peter 
Bickel, which allows us to get rid of assumption (ii) in Theorem 1. (This 
strengthening is postponed to this later section for the sake of clarity.) 

We refer the reader to Section 3.5 for detailed explanations of the conse- 
quences for eigenvalues and eigenvectors of Theorems 1, 2 and 3 as well as 
their extensions. Finally, we stress that, unless otherwise noted, all of our 
results are obtained when we have p x n as n — > oo (allowing the ratio p/n 
to have, for instance, a finite nonzero limit), that is, in the "large n, large 
p" setting. 

Theorem 1. Suppose X is an n x p matrix, and assume that pxn as 
n — > oo. Suppose that the rows of X are independent and identically dis- 
tributed and denote them by {^Q}?=i- Call S p the covariance matrix of the 
vector X\ . We also work under the following assumptions: 

(i) Suppose T, p is (3-sparse with [5 = 1/2 — r] and n > 0. 

(ii) Suppose that the nonzero coefficients of T, p are all greater in absolute 
value than Cn~ a °, with < a = 1/2 - 5 < 1/2. 

(iii) Suppose further that for all (i,j), X^j has mean and finite moments 
of order 4/c(r/) ; with k(rj) > (1.5 + e + rf)/{2rf) and k{rf) S N, for some e > 0. 
Assume that k(n) > (2 + e + (3)/(25 ) . 

Call 

1 n 

S p = — > XjX { . 
i=i 

Call T a (S p ) the matrix obtained from thresholding the entries of S p at the 
level Kn~ a with a = 1/2 — 5 > olq, 5 > and K > 0. Then we have, if we 
call — 7fj(*Sp) 

lll^pllb = lllTo(S'p) — S p |||2 — * a.s. as n — > oo, 

where |||M|||2 is the spectral norm of the matrix M . 
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We postpone a short discussion of the meaning of this theorem to after 
the statement of Theorem 2, which is arguably more interesting practically. 

Proof of Theorem 1. We divide the proof into two parts. The first 
part consists in showing the "oracle" version of the theorem, that is, showing 
that operator norm consistency happens when one is given the pairs 
for which a p (i,j) = 0. The second part shows that empirical thresholding 
does not affect this result. 

Let us first remind the reader of a variant of Holder's inequality. Let 
A\, . . . , A m be random variables with finite absolute mth moment. Then we 
have 



e 



Si=l 



<n E (i^n i/m - 

i=l 



Note that for the case m = 2, this is just the Cauchy-Schwarz inequality. 
So the result is true when m = 2. We prove it by induction on m. Suppose 
therefore it is true for all integers less than or equal to m — 1. Call B\ = 
112=2 ^ • ^ Holder's inequality, we have 

|E(Ai5i)| < (E(|^ 1 | m )) 1 /" 1 [E(|Bir/( m - 1 ))] (m ^ 1)/m . 
Now, by the induction hypothesis, applied to the random variables | Ai \ m / ( m_1 ) 5 

(m \ m 

n \Ai\ m ^ m -^ < jjfEfl^H] 1 ^™- 1 ). 
i=2 / i=2 

Therefore, [E(|Bi| m /(m-i))](m-i)/m < rj™ 2 Ed^l" 1 ) 1 /™ an d the inequality 
is verified. 

Now given j(2k), a closed walk of length 2k and the associated matrix 
M, we clearly have 

2k 

(1) |EK (2fc) )|<E(h 7(2fe) |)<n[E(|M(z i ,i i+1 )| 2fc )]^) ; 

j'=i 

assuming for a moment that the relevant moments exist. 

Oracle part of the proof. Let us first introduce some notation. We de- 
note by <jp(i,j) the (i,j)th entry of S p , the population covariance. We call 
oracle(S'p) the matrix with entries S p (i, j)i ap (i,j)^o an d 

Hp = oracle(S'p) — S p . 

Note that we have 

3 p(*>j) = {S p (i,j) - a p (i,j))l ap{id) ^ Q . 
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In the oracle setting, where we assume we know the patterns of zeros in £ p , 
so we focus on the matrix H p . Clearly T, p and H p have the same patterns 
of O's and nonzero, and so if S p is /3-sparse, so is H p . Equation (1) shows 
that if we can control the moments (3 p (i,j)) 2k , we will be able to bound 
the expected weight of each walk. Now we remark that we can write 

1 n 

Sp(i,j) = - V Z m , 
n , 

m=l 

where Z m 's are independent, identically distributed and with mean 0, since 
S p is unbiased for S p . By expanding the power, we get that 

h,—,hk 

This last quantity can be rewritten 

n n 

Zi x ■ ■ ■ Zi 2k = Y\ Z i i with ki = 2k and ki > 0. 

i=l i=l 

We now remark that if there exists iq such that ki = 1, then E(^ x • • • Zi 2k ) = 
0, by independence and the fact that each of the Z^s have mean 0. Therefore, 
in the expansion of (H p (z,j')) 2fc , only the terms for which all nonzero k^s are 
greater than or equal to 2 will contribute to the expectation. Counting the 
number of distinct indices appearing in the product above allows us to get 
a first-order estimate of E(H p (i, j) 2k ). As a matter of fact, the contribution 
of products with q distinct indices is of order n~ 2k n q : by simply counting 
how many such products there are. So we see that to first order, the only 
products that matter are those for which all the Zi's raised to a nonzero 
power are raised to the power 2. Denoting by ni fc ] the fcth factorial moment 
n(n — 1) • • • (n — k + 1), we have, assuming that E(Z 2fc ) < oo, 

if k is fixed and n — > oo. 

We therefore have 

[E(H p (,, J )) 2fc ]V(^) = (-i=). 

In particular, the weight of a closed walk of length 2k on the graph with 
adjacency matrix A p (T, p ) [or A p (E p )] and weights E p (i,j) has the property 
that 

|E(™ 7(2fc) )| <E(|w; 7(2fc) |) = 0(n- fe ). 
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Since we have assumed that T, p and therefore H p are /3-sparse, we have 

E(trace(Hf)) = 0(p 1+/3 ( 2fc - 1 )n- fc ). 

Since our assumption pxn implies that p/n remains bounded, we see that 
E (trace = 0(n 1 / 2+r >~ 2kri ), where 77 = 1/2 - (3. In particular, if k is 
chosen such that 

, ^ 1.5 + £ + 77 

k > , 

2rj 

we see that 

EClSpH*) < E(trace(Ef )) = O(n^), 

because H p is a symmetric matrix, so its spectral norm squared is one of 
its eigenvalues squared. Using Chebyshev's inequality and the first Borel- 
Cantelli lemma, we conclude that 

IllSpllb — ► a.s. 

Note that 2k > 1 + 1/(27? ) would have guaranteed convergence in probability. 
The above proof is correct if Z m has a finite 2fcth moment. Since Z m = 
XmjXmj, the assumption that the entries of the data matrix X have a 
4/cth moment guarantees the existence of a 2/cth moment for Z m , using, for 
instance, the Cauchy-Schwarz inequality. 

We have shown that ||| oracle(<S p ) — S p |||2 — > almost surely, when the 
conditions of the theorem are satisfied. 

Nonoracle part of the proof. We now turn to the nonoracle version of the 
procedure. It is clear that all we need to do at this point is to show that the 
thresholding procedure will lead a.s. to the right adjacency matrix. Recall 
the notation A p = T a (S p ) — £ p , the difference between our estimator and 
the population covariance. Call B p the event B p ={at least one mistake is 
made by thresholding}, that is, A p (T a (S p )) ^ A p (T, p ). Call E p the event 
{|||A P |||2 > s] and F p the event {|||H p |||2 > e] (we do not index these events 
by e to alleviate the notation). Note that 

E p = (E p n B p ) U (E p n B c p ) OBpU (E p n B c p ) = B P U (Fp n B c p ) CBpU Fp. 

We have already seen that P(F p infinitely often) = 0, so if we can show that 
P(B P i.o.) = 0, we will have P(E P i.o.) = 0, as desired. 

Call Op = oracle(S'p) and S~ = S p — O p , where oracle(S'p) is defined above. 
Note that S~ has nonzero entries only where E p has entries equal to 0; when 
that is the case, S~(i,j) = a(i,j). We call V p the set of pairs (i,j) such that 
a(i,j) = 0, that is, 
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We will first show that the maximal element of S p stays below n~ a a.s. 
Note that in general, for a random matrix M and index sets / and J, 

PI max \mij\>e\< P(\mij\>e). 

The same moment computations as the ones we made for H p above show that 
for the elements of S p corresponding to cr p (i,j) = 0, we have E(S , p (i, j) 2k ) = 
0{n~ k ). Therefore, 

p(max\S p (i,j)\>Cn-A< £ B(S p (i,j) 2k )^ = 0(p 2 n 2ka n - k ). 
KVp 1 <ij)G> p C 

Since we assumed that pxn, we see that if k(l — 2a) — 2 > 1 + e, 
p(m&x\S p (i,j)\ >Cn~ a i.o)j =0, 

by the first Borel-Cantelli lemma. In other words, if we call (T a (S p ))~ the 
thresholded version of the part of S p that corresponds to indices in D p , we 
have that P((T a (S p ))- / i.o.) = 0. 

We now turn our attention to T> p , that is, the set of indices for which 
a p{hj) 7^ 0- Recall that we assumed that these cr p (i,j) satisfied \a p (i,j)\ > 
Cn~ a ° and «o < a. Now note for in V p , and c p (i,j) > 0, we have 

{\S p (i,j)\ < Cn~ a } C {0 < a p (i,j) - Cn~ a < a p (i,j) - S p (i,j)}. So, in this 
case, by using the moment computations made above, and using C to denote 
a generic constant, we have 

P(\b p (z,j)\<Cn )< {(Tp{iJ) _ Cn _ a)2k -U{n n ). 
Similarly, when a p (i,j) < 0, we have 

PC| q (a n\\ <r rr,- a ) < E ( a P^' J ') ~ S p( l 'j)) 2k _ Cl(n- k n 2ka °\ 

P{\b p {i,j)\<Cn )< {K{iJ)l _ Cn - a) 2k n )■ 

Now note that because S p is /3-sparse, there are at most 0(p 1+ @) nonzero 
coefficients in S p ; indeed 4> p (2) counts the number of nonzero coefficients in 
T, p . From this we conclude that 

P(3(i ,io) G ^ : |5 P (io,io)| < Cn a ) < 0(n k ^ 2a ^p 1+ ^. 

So if 

, ^ 2 + e + l3 2 + e + [3 

k > = , 

1 — 2ao 25q 
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then, almost surely, no S p (i,j) will be wrongly thresholded, if € V^. 
Combining this result with the one on the indices in T> p , we have 

P(B P i.o.) = 0, 

and we have the result announced in the theorem. □ 

It is, however, more common practice to use as our estimator of the co- 
variance matrix the sample covariance matrix that differs slightly from the 
matrix S p used above, which is the maximum likelihood estimator in the 
(mean 0) Gaussian case. We now show that for the usual estimator the 
same strategy works. 

Theorem 2 (Sample covariance matrix). Suppose the assumptions of 
Theorem 1 are satisfied, but allow now X\ to have a nonzero mean [i. Call 

1 n 

s P = ^J2(x i -x)(x i -xy. 

i=l 

Then the result of Theorem 1 holds; namely, the matrix T a {S p ) — T, p con- 
verges a.s. in spectral norm to 0. 

The previous theorem basically means that if the covariance matrix £„ 
is sparse enough, and if the data come from a distribution with enough 
moments, then thresholding the sample covariance matrix by keeping only 
terms that are a bit larger than \j\fn is a good idea and will lead to an 
estimator that is consistent in operator norm. This is in stark contrast to 
simply using the sample covariance matrix, when in the asymptotics con- 
sidered here, we would not, in general (e.g., as soon as p/n — > I ^ 0), have 
consistency even at the level of the vector of eigenvalues; in the case of 
Ti p = Id, this is a consequence of the results of [13] or [21] and we refer to 
[12] for a thorough discussion. 

PROOF of Theorem 2. The proof proceeds as the one of Theorem 1. 
Since S p is still unbiased for S p , the only thing we have to show here is that 
the 2/cth central moments of S p (i,j) decay in the same fashion as they did 
in Theorem 1. Firstly let us note that 

1 n n 

S p (i,j) =— rX]( J M ~ Hi){ X l,j ~ Mi) — rP^i - IH)(Xj - fij), 

n 1 i=i n 1 

so 

1 n 

S P (i,j) - o- p (i,j) =^—r22((Xi ji - Hi){X hi - fij) - o- p (i,j)) 
n 1 l=i 
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Now, since (a + b) < 2 (a + b ), we see that we will have the result 
we need if we can bound each term in the right-hand side of the previous 
equation. The technique we used above immediately shows that 

assuming for a moment that all the needed moments exist. For the other part 
of the equation, the same argument shows that the only thing we need to 
control is E((Xj — /j,i)(Xj — fij)) 2k , since the assumptions we made about the 
moments of Xi guarantee that a p (i,j) is bounded in p. Using the Cauchy- 
Schwarz inequality, it is clear that all we need to do is control E(Xj — /Uj) 4fc , 
for all i. But Xi — \n is a sum of independent mean-0 random variables and 
the computations we made in the proof of Theorem 1 show that 

E(*<-*) tt = 0(^). 

Therefore, 

E((^-/i i )(X i - A i,-)) a * = o(^). 
So we conclude that 

B(S p (i,j)-a p (i,j)) 2k = 0^, 

just as in the case of the Gaussian MLE estimator. This is all we need to 
complete the proof of Theorem 2, since the last steps follow exactly from the 
proof of Theorem 1. The assumptions made guarantee that all the moments 
used above exist and are finite. □ 

We note that the distribution of the entries of X can change with n and 
p as long as the moment conditions are satisfied and the bounds on the 
moments are uniform in n and p. We now turn to the question of estimating 
correlation matrices. 

Theorem 3 (Correlation matrices). Under the assumptions of Theorem 
1, but requiring the boundedness of the 8k(rj)th moments of the Xij 's in 
assumption (iii), if E p is now the correlation matrix of the vector Xi, and 
if S p is now the sample correlation matrix, we have as before 

\\\T a (S p ) - S p ||| 2 -> a.s. 

Remark 1. We note that the moment assumption can be relaxed to 
4:k(n), if, for instance, one assumes that |||£p|||2 is bounded. This is a simple 
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consequence of the fact that, if we call D p the diagonal of the sample covari- 

ance matrix S p , the sample correlation matrix S p is equal to D p l ^ 2 S p D p 1 ^ 2 . 
Because S p has only 1-s on the diagonal, our results on operator-norm- 
consistent estimation of S p imply in particular that \\\D p — Id p |||2 tends to 
a.s. Elementary algebra then shows that if |||Xp|||2 is, for instance, bounded, 

j/2 ~ 1/2 ~ 

IH-Dp S p D p — S p ||| 2 also tends to a.s., because \\\S p — £p||| 2 does, ac- 
cording to Theorem 2. 



Proof of Theorem 3. Because of invariance of the problem by cen- 
tering and scaling, we can assume that the row vector Xi has mean 0, and 
that the diagonal of its covariance matrix E p is full of 1. Then we have 
p(i,j) = a p (i,j). From the proof of Theorem 1, it is clear that if we can show 
that E(S p (i,j) — p(i,j)) 2k = 0(n~ k ) for all the same technique as above 
will lead to the theorem. To show that this is indeed the case, we first make 
the following elementary remark, which prepares the study of p(i,j) — p(i,j). 
Suppose that F n and G n are random variables, with E(F n — p) 2k = 0(n~ k ) 
for some p G [—1,1], E(G? n - l) 2k = 0(n' k ) and further \F n /G n \ < 1. Call 
O n (e) the event {to : \ G n — 1| < e}. We have 



E 



2k 



E 



< E 



< E 



TT-P 
Fn 

G n 

F n - 



2k 



[1 



-\ 2k 



- P 
P 



L n„(e) 



+ E 



G n 



1 2k 



P 



1 



ns(e) 



< 2 2fc E 



G n 

F n 



2A- 



1 



f2n(e) 



P 



G n 



2 k 



l n n (s)+P 



2k 



1 



+ 2 2fc E(l^ (e) ) 

2k 



1 



U n {e) 



< 



+ 2"E(lnc (e) ) 

2 2k 



(1-8 
,2fe 



,2A- 



-{E((F B - p) 2k ln n( e)) + P 2fe E((G n , - iy k ln n( e))} 



+ 2 2 *E(lnc (e) ). 

By Chebyshev's inequality, and our assumptions, it is clear that 



E 



Fn 

G n 



P 



2k 



o 



1 



71" 



Now we claim that this remark applies in the case of the correlation 
matrix. We have 

F n 



Kh3) 



G n {i,jY 
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where 



1 



n 



F n {i,j) 



n — 1 



1=1 



and G n (i,j) = y / F n (i,i)F n (j,j). 

From the moment computations made in the proof of Theorem 2, we see 
that we have E(F n (i,j) - p(i,j)) 2k = 0(n~ k ), for all Let us denote by 

Y n (i) = F n (i,i); this result implies that 



E(^(i) - VPM) 2 " < E(y„(i) - P (i, i)? k /p{i, i) k = E(F„(i) - lf k 

= 0(n~ k ). 

If we denote a n = y/Y n (i) and (3 n = y/Y n (j), we have, since a n f3 n — 1 = 
(a n - l)(3 n + (3 n - 1, 

E(a n /? n - l) 2fc < 2 2k [B(p n - l) 2k + B((3 2 n k (a n - l) 2k )} 



We have already seen that E(/? n — l) 2k = 0(n fc ), and since we are assuming 



the existence of a 8/cth moment for the Xij, we also have i/E(a n — l) 4fc = 



0(n~ k ). To conclude that E(a n /3 n - l) 2k = O (n k ), we just need to show 
that E(/3 n ) 4fc is bounded. But ffi = (/?„- 1 + l) 4fc < 2 4fc ((/? n - l) 4fc + 1), from 
which we conclude that E(/? n ) 4fc is bounded, since (j3 n — l) 4fc = 0(n~ 2k ). We 
now see that G n (i,j) = y/Y n (i)Y n (j) satisfies with F n the conditions needed 
to conclude that 



3.2. Approximation of nonsparse matrices by sparse matrices. It is nat- 
ural to ask whether a thresholding approach can also lead to good results 
when dealing with matrices which are not sparse per se, but have many co- 
efficients close to zero. In other words, we would like to know when we can 
approximate nonsparse matrices by sparse matrices obtained by thresholding 
a sample covariance or correlation matrix. We now present two propositions 
that relax the sparsity assumptions and still lead to spectral norm conver- 
gence. The most general one basically says that if the population covariance 
matrix can be approximated by a sparse matrix and does not have too many 
coefficients close to the threshold level 1/y/n, then estimating the (not nec- 
essarily sparse) population covariance by thresholding the sample covariance 
matrix will lead to good results. 

Here is our first result in this direction: 



< 2 2k [B((3 n - l) 2k + ^E(/?f)^E(« n -lH. 





□ 
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Proposition 1. Making the same general assumptions as in the theo- 
rems above [i.e., assumptions (i)-(iii) in Theorem 1 are excluded], we now 
assume that: 

• There exists T ai (Ep) = E p , a version of E p thresholded at Cn~ ai , that 
is, (3-sparse, with /3 = 1/2 — 77 and rj > 0. Further we assume that |||E p — 
Sp|||2-^0. 

• We call I ai ,ao th e se ^ of indices of those cr(i,j) for which Cn~ ai < \o~(i,j)\ < 
Cn~ a ° , with a < ot\ < 1/2 — 5q, for some 5q > 0. 

• The adjacency matrix corresponding to I ail(Xo is 7 -sparse, for some 7 < 
oo — Co, where Co > 0. 

• The random variables Xij have moments of order 4k (8k in the corre- 
lation case), with k satisfying the assumptions put forth in Theorem 1, 
assumption (iii), as well as k > (2 + e — 7)/(l — 27), for some e > 0. 

Now if we choose a E (ao,ai), then the conclusions of all the theorems above 
apply: 

\\\T a (S p ) — Sp||| 2 — > a.s., as n — > 00. 

While this proposition might appear full of hard-to-check assumptions, we 
believe it is useful and not so hard to use when checking whether thresholding 
is a reasonable idea for a particular estimation problem. We give an example 
after stating Fact 3 below. Finally, we note that under the assumptions 
stated, both T ao (E p ) and T ai (E p ) are good approximations of T, p in operator 
norm. 

Proof of Proposition 1. In the proof we assume without loss of 
generality that C = 1, which allows us to avoid cumbersome notation. [As the 
reader will see, replacing n~ aa by Cn~ a ° every time an n~ a ° (and similarly 
for n~ ai ) appears does not change anything in the proof.] 

From the previous proofs, we see that we can divide 

T a (S p ) = M + M 1 + M 2 

into three parts. Mq corresponds to the indices (i,j) for which o~(i,j) is 
larger (in absolute value) than n~ a °, M\ corresponds to indices in I a0lCn , 
and M2 to those indices for which |(j(i,j)| < n~ ai . Similarly, we can write 
with the same partition of indices, 

E p = T ao (Ep) + [T ai (Ep) — T ao (Ep)] + [E p — T ai (E p )] = Eo + Si + E2. 

With the same notation for the subparts of E, we have from the computa- 
tions we made in the proofs of the previous theorems that ||| Mq — So ||| 2 - > 
a.s. (by the oracle part of the proofs), and HIM2IH2 — ► a.s., since the d p (i,j) 
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corresponding to |cr(i,j)| < n~ ai will all be (a.s.) thresholded to if the 
thresholding level is n~ a , a < a\. 

Note that So + Si = S p , so IHX2IH2 - ► 0. To reach the conclusions of the 
proposition, we need to show that we control M\ — Si in operator norm. 

Recall that our assumption is that Si is 7-sparse. We call 

Si = T Q (Si) + R a (Li), 

where T a (£i) is the version of Si thresholded at n~ a . It is of course also 
7-sparse and so is R a (Ei). This implies that 

\\\R a (^i)\\\f < trace((i? Q (Si)) 2fc ) < f (k)p^ 2k ^ pn~ 2ka , 

which goes to if 7 < a — e. So we can find ko, an integer independent of 
n and p, such that the right-hand side goes to as n and p go to infinity. 
This implies that ^^(Si)!!^ — ► 0, as n tends to infinity. 

Using the oracle proof of Theorem 1, we see that if we make no error in 
thresholding for the indices in I a0t ai > then ||| oracle a (Mi) — T a (Si)|||2 tends to 
a.s. Therefore, all we need to do is check that we control the operator norm 
of the matrix of possible errors, that is, the difference Mi — oraclec, (Mi ) . Let 
us call Ti this matrix of potential errors. There are two types of possible 
errors: either a coefficient is thresholded when it should not have been, or it 
is not thresholded when it should have been thresholded. So 

!0, if correctly thresholded a(i,j), 

a(i,j), if < n~ a but did not threshold in Mi, 

—a(i,j), if \<Ti j\ > n~ a but did threshold in Mi. 

In any case, we conclude that |Ti(i,j)| < \d(i,j)\ < \a(i,j) — a(i,j)\ + 
\<j(i, Let us call In the matrix 

and T12 the matrix with entries 

Ti2(i,i) = l Tl (i,j)^oW(i,j)\- 

Note that all the indices where Ti has potentially nonzero entries are in 
I aom , so the corresponding adjacency matrix is 7-sparse. Clearly, the same 
is true for Tn and T12. 

Now, IIIT1IH2 < |||Tn + T12IH2, according to Lemma A. 2 in the Appendix. 
Therefore, 

IIIT1IH2 < HI 11 III 2 + |||Ti2 Ho- 
using the fact that all the entries of T12 are less than n~ a ° in absolute 
value, and the fact that T12 is 7-sparse, we have, for k integer, according to 
Lemma A.l below, 

|||Ti 2 i fc < trace(T^) = O^^" 1 ^^- 2 ^). 
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So since p x n, and we assumed that 7 < ao, we see that we can find k 
(finite) such that the right-hand side goes to as n — > 00. Actually, any 
k > (1 — 7)/(2(a — 7)) is a valid choice. So we conclude that 

HI 12 HI 2 — > as re — > 00. 

On the other hand, since 

E(T n (i, j)) 2k < E(a(i,j) - a(i,j)) 2k = 0(n' k ), 

we conclude as before that the expected weight of a walk ("on" Tu) of 
length 2k is 0(n~ k ). Using the assumption of 7-sparsity of the matrix Si, 

we conclude that E(trace(Yf2)) is 0(p" / ( 2k ~ 1 ^ +1 n~ k ) . Therefore, if we can 
pick k> (2 + e — 7)/(l — 27), and finite, we have a.s. convergence of |||Tn|||2 
to 0. Now note that our moment requirements imply that indeed we can 
pick k (finite), with the property that k> (2 + e — — 27). Hence, 

HI "IT 1 HI 2 — ► a.s. 

This concludes the proof since we have bounded |||T Q (5 P ) — £ p |||2 by a sum 
of operator norms of matrices, all of which are going to a.s. □ 

We also have the following proposition. 

Proposition 2. Let us denote by |X p |Had the Hadamard absolute value 
of Hp, that is, the matrix whose (i,j)th entry is equal to \o~(i,j)\. Suppose 
that 1 1 Tip I Had III 2 is uniformly bounded in p. Suppose p^n. Let us call, for 
a > 0, 

RaO-jp) = Yip — T a {Yip). 

Suppose that ||||i?Q (S p )|Had|||2 - > 0, for some given a®, with uq < 1/2 — 5q, 
and 5q > 0. Then, for ao < a < 1/2 — 5q, we have 

III^Sp) - S p ||| 2 a.s., 

provided the moment conditions in Theorem l(iii) are satisfied, with the 
parameters So, 77 = 5o and hence (3 = 1/2 — <5o- 

Proof. Since we have assumed that ||||Xp|Had|||2 < 00, we therefore have 
||||T Q (£ p )|Had|||2 < 00, by using Lemma A. 2 in the Appendix. Now since the 
smallest nonzero entry of |T a (S p )|Had is greater than n~ a , we also have, by 
the same arguments as those developed in Lemma A. 2, for any integer k, 

trace((|T Q (£ p )| Ha d) fc ) > <P P {k)n~ ak . 

Hence, 

trace((|T Q (£ p )| H ad) 2fc ) 1/{2fe) > {<P P {2k)) 1 '^ n~ a . 
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We now note that without loss of generality, we can assume in our definition 
of /3-sparsity that f(k) > 1. Hence, under our assumptions, T a (£ p ) has to 
be at most a-sparse, for otherwise the lower bound in the previous equation 
would go to infinity, and we just saw that ||| |T a (S p )|Had|||2 is bounded. [Recall 

that ||||r Q (E p )| H ad|||2 = lim*_»oo trace((|T a (£ p )| Ha d) 2fc ) 1/(2fc) •] 

Now, it is clear that if a\ > ao, and therefore n~ ai < n~ a °, \R ai (i,j)\ < 
\R ao (i,j)\. This allows us to conclude that 

III I -Rail Had III 2 < III \Ra I Had lb ■ 

Since we have assumed that ||| |i? ao | Had III 2 tends to 0, it is also the case for 
III I Ra I Had III 2 for a > cto- Since we assumed that oiq < 1/2 — So, we can find a± 
such that ao < ol\ < 1/2 — 5q. Let us pick one such a\. Let us also pick a in 
(ao,ai). The situation is now fairly similar to that of Proposition 1, but we 
cannot immediately apply this result because our control of the sparsity of 

^ai,a i s n °t ver y good. 

However, we can apply similar arguments that we detail here. We use 
the same decompositions and notation as in the proof of this theorem. Note 
that Si =T ai (S p ) — T Qo (Sp) is a submatrix of E. Hence, |i? a (Si)(i, j)\ < 
\R a (T,)(i,j)\. Since ||||-R a (S)|Had|||2 goes to 0, we have 

||||-Ra(£l)|Had|||2 ~^ 0- 

Note also that ||| 5^2 1|| 2 = lll-Rai (^)|||2 < llll-Rai (^OlHadlh — >• 0. So all we need 
to do to complete the proof is to control the matrix Ti of potential errors. 
Recall that 

|Ti(i,j)| < < ~ + 

Recall also that all the indices where Ti has potentially nonzero entries 
correspond to the entries of E p whose absolute values are between n~ ai and 
n~ a ° , so Ti is at most ai-sparse. Let us call, as before, Tn the matrix 

and T12 the matrix with entries 

Ti 2 (»,i) = i T i(i,i)^ok(^i)l- 

Clearly, 

|||Tl|||2 < ||||Ti|Had|||2 < III^H + T12IH2 < IIIT11IH2 + HI TC" 12 III 2 - 

Because Tn is ai-sparse and a.\ < 1/2 — 60, and because of the oracle part 
of the proof of Theorem 1, we see that, because of our moment assumptions, 

|||Yii||| 2 - >• a.s. 
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On the other hand, T12 is a submatrix of |Si|Hadj so 

IIIT12IH2 < llll^ai (Sp) — (£ p ) I Had III 2 

< IIH-Rai(Sp) — -Ra (£p) I Had III 2 

< IIH-Rqi (Sp)|Had|||2 + III I -Rao (^p) I Had III 2- 

We conclude that ||| TTi2 III 2 —>■ 0, and therefore that |||Ti|||2 — > a.s. Arguing 
as in the proof of Proposition 1, we finally have the result announced in 
Proposition 2. □ 

The following simple fact is a clear case of applicability of the ideas of 
Proposition 1. 

Fact 3. Let e > and suppose that Ti +£ (S p ) is (3 -sparse and its nonzero 
entries are larger in absolute value than n~ a ° . Then under the same assump- 
tions as Theorems 1, 2 and 3, we have, for ao < a < 1/2 — S, 

\\\T a (S p ) -Sp||| 2 — >0 a.s. 

Proof. Take a\ = ao + 5, where 6 is small. In particular, of course, a\ < 
1/2 < 1 + e. Here I ai ao is empty so the corresponding matrix is 0-sparse. 
In particular, that means that in the notation of the proof of Proposition 
1, Mi = and similarly for Si. So the results on Mq — So and M2 apply 
directly and the only thing we have to check is that HIS2IH2 —> 0. Note that 
S2 contains only entries of order re~( 1+e ) or smaller. Using a Frobenius norm 
bound, we therefore have 

iiis 2 i<p 2 ^ (2+2£) -o, 

so the result is established. □ 

Example: a simple (permuted) Toeplitz matrix. We consider a matrix 
that is often used as an example for estimation: the (Toeplitz) covariance ma- 
trix Sp, with S(i, j) = \p\ < 1. Of course, we can also consider the same 
matrix where the variables have been randomly permuted and hence the 
Toeplitz structure destroyed. However, on any given line, the entries are still 
a (possibly random) permutation of the p^~^. We apply Proposition 1. To 
do so, we just need to count how many coefficients on each row are between 
n~ ai and n~ Q() , for ao an d «i to be chosen later. Note that \p\ k < n~ ai is 
equivalent to k> log(n)ai/log(l/|/?|). So T Ql (S p ) is asymptotically 0-sparse, 
as it contains only 0(log(n)) nonzero terms on each row. Similarly, the ad- 
jacency matrix corresponding to J(ao,ai) is also asymptotically 0-sparse as 
there are at most 0(log(n)) terms on each of its row. Finally, we need to 
check that the thresholded S p is a good approximation of S p . Recall that for 
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a real symmetric matrix M, |||M|||2 < max^j |mjj|). (See, e.g., [7] or [24], 
page 70.) Now, £ fe > feo P k = P k °/(l ~ p), so |||E p -T Q1 (£ p )||| 2 < n~^/(l - \p\), 
which tends to as n goes to infinity. So we conclude that Proposition 1 
applies and thresholding the sample covariance (resp., correlation) matrix 
corresponding to this population covariance will yield an operator norm con- 
sistent estimator, a.s., provided the moment conditions are satisfied. In this 
situation, the moment conditions translate simply into k > 2 + e for some e, 
because ao can be chosen arbitrarily close to 1/2 and 7 arbitrarily close to 
0. 

Finally, we have the following corollaries that apply to all the theorems 
and proposition above. 

Corollary 1 (Infinitely many moments). Suppose that the entries of 
X have infinitely many moments. Then all the above results hold with only 
the sparsity conditions having to be checked. 

Corollary 2 (Asymptotic /3-sparsity). Suppose that the sequence £ p 
is asymptotically (3-sparse. Then all the above results apply, with the mod- 
ification that (3 be replaced by e = (3 + e for e > but arbitrarily small. In 
particular, moment conditions need only to be satisfied and checked with f3 £ . 
In the situation where one has infinitely many moments, one therefore only 
needs to check that the sparsity conditions are satisfied by a (3 £ . 

3.2.1. A refinement of Theorem 1. We now discuss a refinement of Theo- 
rem 1 that allows us to get rid of assumption (ii) there. We remind the reader 
that this assumption is about the size of the nonzero elements of E p . The 
possibility of this refinement was suggested to the author by a question of 
Professor Peter Bickel whom we thank for his very insightful question. This 
discussion is included here because it relies on approximation ideas close 
to the ones we developed for approximating nonsparse matrices by sparse 
matrices. However, here we will approximate sparse matrices by sparse ma- 
trices, the approximating matrix now having quite "large" elements. 

Let us first mention the following lemma, which is proved in the Appendix. 

Lemma. Suppose M is apxp real symmetric matrix, which is (3-sparse. 
Call m = maxjj |Afj j|. Then 

Vfc G 2N |||M||| 2 < I trace(M fc )| 1 / fc < m/ (1 - 1/fc)+1/fe (/(A;)) 1/fe . 
We therefore have the following corollary. 

Corollary 3. Suppose T, p is (3-sparse, with (3 < 1/2 — 77 and r\ > 0. Call 
Tfl +e (Ep) a version of E p thresholded at Cn~^ +£ \ where e > and C is a 
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real number (fixed and independent of p and n ). Call R/3+ £ = E p — T^_|_ e (£ p ) . 
Assume that p^n as n — > oo. Then 

\\\Rf3 +£ \\\2 —> asn^oo. 

The conclusion of the previous corollary is that /3-sparse matrices, regard- 
less of the size of their entries, can be approximated in operator norm by 
/3-sparse matrices whose nonzero elements are greater than n~^ +£ \ 

Proof of Corollary 3. We note that -Rg +e (E p ) is /3-sparse, because 
Tip is, and the graph corresponding to the adjacency matrix i? ( g +e (S p ) is a 
subgraph of the one corresponding to the adjacency matrix of £ p . Note also 
that all the entries of Rp +£ (Ti p ) are less in absolute value than Cn~^ +£ \ 
According to the previous lemma, we therefore have 

\\R P+e {T P )h < vm^cn-^p^ 1 - 1 ^ 1 ^. 

So if k > 1/e, because our assumption that px n implies that p/n remains 
bounded, the right-hand side in the previous equation goes to zero as n goes 
to infinity [Recall that f(k) does not depend on p.] □ 

We are now ready to state our improvement of Theorem 1. 

Theorem 4. Making the same general assumptions as in Theorems 1, 
2 and 3 above [i.e., assumptions (i)-(iii) in Theorem 1 are excluded], we 
now: 

(a) Assume that X p is (i-sparse with /3 = 1/2 — n and r] > 0. 

(b) Pick £o > such that, for some 5q > 0, (3 + £o < 1/2 — 5q- 

(c) Assume that the random variables Xij have moments of order 4k 
(8k in the correlation case), with k satisfying the assumptions put forth in 
Theorem 1, assumption (hi). 

Then, if Tp +£0 / 2 (S p ) is the matrix obtained by thresholding the entries of S p 
at level Kn~^ +£ °/ 2 ^ (S p having the definition given in Theorems 1, 2 and 
3), for some K > (fixed and independent of n and p), we have 

\\\ T P+e /2(Sp) - Splb -> a.s. 

Proof. Let us first note that there exists an eq with the characteristics 
we require. A possible choice is £o = rj/2, to which 5o = n/2 could correspond. 

The theorem is therefore a consequence of Proposition 1. As a matter of 
fact, let us pick a\ = f3 + Eq and a^ = (3 + Eq/A. As we have seen in Corollary 
3, T/3 +£0 (T, p ) is /3-sparse and has the property that |||Tg +£0 (£p) — S p |||2 — > 0. 
Clearly, ao < a± < 1/2 — 5q. In the notation of Proposition 1, I ai ,a a is a 
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subset of the set of indices for which cr(i,j) ^ 0, and hence it is /3-sparse. So 
in the notation of Proposition 1, our I ai , ao is 7-sparse with 7 = (3 < uq — Co, 
where Co = £ o/4. Note also that the moment assumptions made in Theorem 
4 correspond to the moment assumptions made in Proposition 1, with 7 = (3. 
So the conclusion of Proposition 1 applies, and in particular, we can take 
a = (3 + e /2, since (3 + e /2 £ (J3 + e /4,(3 + e ). □ 



3.3. About 1/2-sparse matrices. The previous computations are clearly 
limited to the case where j3 < 1/2. A natural question is therefore to ask 
if this limitation is inherent to the problem, or if it is a consequence of 
the bounds we use in the mathematical analysis. We now want to highlight 
the problems that occur in the case (3 = 1/2 and show that our result is 
"sharp": at the level of generality at which we are working, (at least some) 
1/2-sparse matrices are not estimable consistently in operator norm by hard 
thresholding. To show this, we will produce a 1/2-sparse matrix that cannot 
be consistently estimated in operator norm even at the oracle level. In what 
follows, we assume that p/n has a finite nonzero limit, /, as n tends to 
infinity. 

To do so, we consider estimating a matrix A of the following form: 



1 

0'2 



o 



p-i 



V a v 



a 2 
1 








03 









a p \ 





1 / 



To simplify the problem, we assume that the data are multivariate Gaus- 
sian, with mean 0, and that we know that the diagonal is composed only of 
l's. We estimate S p using the sample covariance matrix, putting to 1 the 
main diagonal, and using the oracle information to put to all other terms 
except the first row and columns. We call S p the corresponding estimator. 
Note that 





«2 - «2 



O 



V 



p-1 



a 
a r 



p-i 



a 2 — 0L2 









a 3 - a 3 









a, 



Using the Schur complement formula for determinants (see, e.g. 
22), we see that the characteristic polynomial of this matrix is 



[16], page 



p(X) = X^ 2 A 2 -J>, 



a; 
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and therefore 

p 



|Sp — Sp|||2 



i=2 



Note that the computation holds for the corresponding adjacency matrix, 
giving that (f)p(2k) = trace(Ap fe ) = 2(p — l) k . So this matrix is 1/2-sparse. 

Now, since we assume the data are Gaussian, it is clear that Ai = |||S P — 
2 has infinitely many moments, using, for instance, Probenius norm as a 
bound on X 1 . Also, E(A?) = J2 P i= 2 E (( a i ~ S *) 2 )- Th e 

covariance of elements 

of the sample covariance matrix is well known in the Wishart case; see, for 
instance, [2], Theorem 3.4.4, page 87. In our context, we see that E((aj — 
Si) 2 ) = (1 + a 2 )/(n — 1) = Vij{n — 1). In particular, 

E(XI) = P - 1 + ^=^ >^^1>0. 
n — 1 n — 1 

We now turn to showing that X 2 actually converges in probability to this 
limit. 

A standard result in Gaussian multivariate analysis (see [2], Theorem 
3.3.2) states that we can write Sj — Oj = (X)fc=i ^k)/{ n ~ 1)> where the Z k s 
are i.i.d. and mean 0. Hence we get that 



^{a i -a i f-v l /{n-l)f) = r 

[n 



— jEy z kl z k2 z ks z k A. 



In the above sum, the terms that contain an index repeated only once 
contribute zero to the expectation. After elementary computations, we see 
that to first order this expectation is 0(2v 2 /n 2 ). Using the same ideas (see 
Appendix), we get that, for i / j, 

(2 1 

(((Si - a^ 2 - Ui/(n - l))((oj - aj) 2 - Vj/(n - 1))) = 0(^a 2 a 2 V 



E 

Hence we have that 



var 



<*-o(£3 + g!£v£ 



p 2v 2 2 / * o\ 2 1 



(E n 2 + n 2[Z a i) V n J- 

Therefore, if, for instance, ai = var(Af) = + ^) — > and 
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and therefore 



A? > — in probability. 

n 



Note that if we had tried to estimate S p using oracle information about the 
location of the nonzero coefficient but nothing about the fact the diagonal 
was equal to 1, we would have encountered the same problem. As a matter 
of fact, if we call M p the diagonal matrix with entries a(i,i), we have from 
previous results in the paper (our moment computations and the O-sparsity 
of this matrix) that |||M p — Id p |||2 — > a.s. Note that because S p had l's on 
its diagonal, 

/a(l,l) a 2 S 3 ... a p \ 

02 5(2,2) ... 

T,p + M p -ld p = : : ■-. •.. q 

S p _i a(p-l,p-l) 
\ a p ... <r(p,p)J 

So for the oracle estimator that uses only information about the location 
of the nonzero coefficients, we have 

~ p — 1 

|||S P + (M p - ld p ) - SpHI 2 > |||S P - Sp||| 2 - |||Mp - Idp||| 2 > a.s. 

This example shows that even using oracle information for estimation 
of the Sp pointed out above does not lead to an operator norm consistent 
estimator, in the presence of this simple 1/2-sparse graph. This suggests 
that for these graphs, simple thresholding might not be a good method. It 
also suggests that the conditions of our theorems have more to do with the 
method we propose than with unrefined mathematical details in its analysis. 



3.3.1. Complement on this example. In what follows we investigate in 
more details the case where cti = 1/y/p. One might ask whether, despite 
the fact that |||S p — S p |||2 does not go to zero, S p does not have some good 
characteristics as an estimator of S p anyway. In what follows, we show that 
for both the eigenvalues and eigenvectors, this is not the case. 

The previous computations essentially show that 

- Id *» - g"? + - - Id » 2 + £t + wky 

so at the level of eigenvalues, the answer is negative. Note that the eigen- 
vectors of Sp — ldp and therefore of S p are known. The ones corresponding 
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to the nonzero eigenvalues are, calling A- 



E 



i>2 a i ) 



U+ 



V2X- 



«2 



and 



v^Ah 



-OL2 



We call u+ the eigenvector corresponding to the positive eigenvalue of S p — 
Id p . When ctj = 1/y/p, cov(Sj,Sj) = (l i= j + l/p)/(n — 1) and A + = ^/(p - l)/p. 
Using the expression above for the eigenvectors, we have 

2\ + \ + (u + ,u + ) = A+A+ + -^—^2,ai. 

Now var(£i> 2 S 4 ) = (p - 1)(1 + l/p)/(n - 1) + (p - l)(p - 2)/(p(n - 1)), and 
E(Ei>2 Sj) = (p — l)/y/p, from which we conclude that 



1 



VP 

Since when all a 



i>2 J 



1 

1 - - 

p 



in probability. 



var 



Ea?)<2fvarfc( 

i>2 / \ \i>2 



4 I 

H — var 
p ' 

-W, A+- 



E 



the above computations show that, since p/n 
ity and therefore, using Slutsky's lemma, we get that 

2 



,i>2 / / 

VT+7 in probabil- 



<«+»«+/ 



in probability. 



So when p/n has a finite nonzero limit, the angle between these two vectors 
has a finite nonzero limit (in probability), showing that the eigenvectors are 
not consistently estimated. 

3.4. Discussion. In the following, we call S p our (final) estimator of T, p , 
which is obtained from the standard estimator S p . As above, we denote 
A„ = Sp — Sp, Sp — oracle(iSp) — S p , where oracle(<Sp) is the oracle version 
of S p , and Dp = S p — S 



v 



3.4.1. Finite- dimensional character and sharpening of the bounds. As is 
clear from the proofs, all the bounds we derive are valid at n and p fixed. 
Essentially, we get bounds on the probability of deviation of the largest 
eigenvalue of the matrix A p from 0. These bounds are polynomial in nature 
since we used Chebyshev's inequality and worked with moments. 
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Note that in particular cases, such as when the entries of the data matrix 
are bounded or satisfy certain tail conditions, these bounds can be sharp- 
ened by using (exponential or Gaussian) concentration inequalities for the 
difference djj = cf(i,j) — a(i,j). If the entries of X are bounded in absolute 
value by a constant C, in the setting of Theorem 1, Hoeffding's inequality 
(see [15]) would, for instance, give that 

P(\dij\ >t) = P(\a(i,j) - a(i,j)\ >t)< 2exp(-ni 2 /(2C 4 )). 

This is a simple consequence of the fact that cr{i,j) is a sum of i.i.d. random 
variables and their mean is a(i,j). (Of course, a slight adjustment is needed 
when dealing with sample covariance matrices, but it does not change the 
exponential character of the bounds. We give the argument in the simplest 
case where S p = X*X/n, the Gaussian MLE when we know the mean is 
zero.) Suppose that the nonzero coefficients of E p are bounded below, in 
absolute value by Cin~ l / 2+b . If we call B p the event B p = {at least one 
mistake is made by the thresholding procedure}, and if we decide to refine 
our thresholding to a (log(n)) a / \fn threshold, we see, using a simple union 
bound, that 

P(B P ) < 2p 2 (exp(-(logn) 2 7(2C7 4 )) + exp(-((logn) a - CV) 2 /(2C 4 ))). 

Therefore, by adding assumptions to our problem, we are able to get sharper 
bounds on the probability of making a mistake by thresholding. 

We can also get better bounds on the probability that |||H P |||2 > £ and 
III 1 2 > £■ We assume that E p is /3-sparse and use the corresponding nota- 
tion. Of course, the event |||Hp|||2 > e is contained in the event trace(H 2fe ) > 
e 2k , which is contained in the event max |w; 7 (2/c)| > e 2fc /(/(/c)p 1+ ^( 2fc_1 ') 
which is contained in the event max \dij\ > £ /(/(jt)l/(2fc) p l/2fc+/3(l-l/2fc) ) _ 
Hence, by using Hoeffding's inequality, we get 

P(|||H P ||| 2 > e) < 2p 2 exp(-n e V 2 V /3 ~ 1)A 7(2C 4 /(A : ) 1/fc )). 

Finally, using the fact that {|||A p ||2 > e} C ({|||S P |||2 > e} H B°) U B p , we see 
that 

P(IA p || 2 > e )<P(5 p ) + P(|||3pI 2 >e) ) 

for which we just derived bounds. Similar types of bounds can be obtained in 
the context of Theorem 2, when, for instance, Hoeffding's inequality applies. 

Though these results are sharper than the ones announced in the theorems 
above, they are less general. Because one of our concerns was distributional 
generality, we decided to give the theorems in general form with less sharp 
bounds. 
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3.4.2. Beyond the bounded p/n assumption. A close look at the proofs 
of the theorems and the bounds above reveals that the assumption that 
p/n remains bounded can be relaxed. As a matter of fact, our bounds on 
expected values of traces are generically of the form 0(p 7 n~ A ), and all we 
require is that this quantity goes to zero fast enough. If we focus on the 
oracle version of the theorems, we see that the bounds are of the form 

E(trace(~f )) = 0(n" y+0(2*-i)). 

If p = 0(n 1 '), we see that the exponent in n becomes of the form k(2/3v — 1) + 
— (3). If this quantity is less than — (1 + e) for some e > and k = ko, then 
we will have a.s. convergence of H p to zero in operator norm. This condition 
is satisfied if 

„< *-< 1 + e > 



l + /?(2fc-l)" 

So in particular, if we are working with random variables with infinitely 
many moments, the oracle results will hold almost surely for a /3-sparse 
matrix when 

p = 0(ro 1// ( 2 ^ -r ') for some r\ arbitrarily small. 
As a matter of fact, all we need to do is pick a finite number k\ such that 

fci-(l + e) 



l + /3(2&i-r 



> V(2/3) -V 



and carry out the analysis for E(trace(Hp ki exists (and is finite), be- 
cause i+ff^fc-i) ~~ * 1/ (2/3), as k goes to infinity. If there are only 4fci mo- 
ments, the results will hold, too. 

On the other hand, the nonoracle results will be satisfied in the context 
of Theorem 1 as soon as 

^ fc(l-2ao)-(l + e) 
v < , 



a constraint much less restrictive than the previous one in general. Finally, 
we note that Proposition 1 would apply if, assuming the other constraints 
were satisfied, we also had 



1+ 7 (2A;-1) 
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3.5. Consequences of spectral norm convergence. 

3.5.1. Convergence of eigenvalues. We recall some classical facts from 
matrix analysis. Firstly, if A and B are two symmetric matrices, and if Aj 
is their ith eigenvalue, where the eigenvalues are sorted in decreasing order, 
we have, by Weyl's theorem (Theorem 4.3.1 in [16]) 



Because the matrix S p is symmetric, the thresholded version of it is sym- 
metric, too. Therefore the operator norm convergence we showed implies the 
following: 

Fact 4. When the thresholded estimator T, p is a spectral norm consis- 
tent estimator of the population covariance or correlation matrix T, p , all the 
eigenvalues of S p are consistent estimators of the population eigenvalues. 

3.5.2. Convergence of eigenvectors. Perhaps even more interestingly, con- 
trolling the spectral norm allows us to get very good control on the angles 
between the eigenspaces of the population and sample covariance matrix, 
through the use of the classical sin(0) theorems of Davis and Kahan ([10], 
Section 2, and [24], Section V.3). For the sake of completeness we quote a 
version of this important result (Theorem 2 in [10]) and show how to exploit 
it in our context. 

Theorem 5 [sin(0) theorem]. Suppose S p has the spectral resolution 



with (X1X2) an orthogonal matrix, X\ being a p x k matrix. Suppose Z is a 
p x k matrix with orthogonal columns, and for any Hermitian matrix M of 
order k, call R = T, p Z — ZM . Suppose the eigenvalues of M are contained in 
an interval \ct,(3] and that for some 5 > 0, the eigenvalues 0/L2 o,re contained 
in R \ [a — 6, f5 + S] . Then for any unitarily invariant norm, 



where Q[TZ(Xi),TZ(Z)] stands for the canonical angles between the column 
space of X\ and that of Z , and smQ[TZ(Xi),TZ(Z)] is the diagonal matrix 
containing these angles. 

These angles are closely connected to canonical correlation analysis: their 
cosines are the canonical correlations for the "data matrices" X\ and Z . 
We therefore have the following corollary to Theorems 2 and 3: 



Xi{A)-Xi(B)\ <|||A-S|||2. 




sin6[^(Xi),^(Z)]|| < 



R 



6 
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Corollary 4 (Consistency of eigenspaces) . Suppose S p has a group of 
eigenvalues contained in an interval and separated from the other eigenvalues 
by 5 > 0. Call the set of their indices (after, say, ordering them) J. Then the 
canonical angles between the column space of the corresponding eigenvectors 
and the column space of the eigenvectors ofT, p (our thresholding estimator) 
corresponding to the eigenvalues ofT, p with index set J goes to zero a.s. 

Proof. Call Xj the eigenvalues of T, p with index set J. Let M be the 
diagonal matrix with diagonal entries the {Xj}. Call L2 the set consisting 
of the other eigenvalues of E p . Note that the convergence of eigenvalues 
guarantees that the {Xj}j^j will a.s. stay away from L2, by a distance at 
least 82 > 0. Call Zj the eigenvectors corresponding to Xj and Z the matrix 
with columns Zj (if some eigenvalues have multiplicity higher than 1, then we 
pick a set of such eigenvectors) . We can write S p = T, p — A p with ||| A p \\\ 2 — > 
a.s. Note that %Z = ZM, so T, p Z = ZM - A p Z. Therefore R = -A p Z 
and because ||| • ||| 2 is matrix norm and the columns of Z are orthonormal, 
III-RIII2 < III III 2- Applying Theorem 5 with these inputs gives the result. □ 

3.6. Practical considerations. The theoretical part of this paper essen- 
tially says /3-sparse matrices with (3 < 1/2 are asymptotically estimable, in 
the strong notion of estimability induced by the spectral norm. However, 
it does not give much information about how to choose the thresholding 
parameter. 

In practice, covariance matrices are estimated for a purpose other than 
simply estimating them. So in concrete applications, users would most likely 
be able to find a penalty function that incorporates a measure of performance 
of a certain estimator and mitigates it with how sparse the corresponding 
matrix is. Then cross-validation or resampling techniques might be used to 
assess the performance of different estimators and choose the threshold from 
the data. Note also, that in [7], Section 5, the authors propose a technique 
for choosing a banding parameter from the data, which is shown empirically 
to work quite well. Such technique is transferable in our context, through 
some fairly straightforward steps. 

However, a shortcoming of resampling techniques is their heavy compu- 
tational cost. Thresholding methods are appealing because they are easily 
"parallelizable" and can be used on very large dimensional datasets. There- 
fore having an a priori method that works reasonably well and is not too 
computationally expensive is also worthwhile. Of course there is a clear link 
between thresholding and testing the hypothesis that a certain parameter is 
0. As a practical ansatz, one method that can be tried is the following: get 
a p-value for the hypothesis o~(i,j) = for all i > j. Such a p- value can be 
obtained by bootstrap methods and since we are dealing with means those 
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reduce to a simple z-test. Then perform the Benjamini-Hochberg procedure 
(see [5]) for these p- values, using the FDR parameter 1/y/p. Though the 
theoretical part of [5] does not apply, we found in the practical examples 
we ran (limited to Gaussian simulations and relatively simple population 
covariance matrices) that this worked reasonably well. We include some fig- 
ures illustrating our simulations (see Appendix A. 2). If speed is the most 
important issue, not using the FDR but testing each entry at level a/-Jp 
seems also to yield reasonable results. 

The issue of positive semidefiniteness. We note that it is possible that 
our estimators will not be positive definite: thresholding entrywise the sam- 
ple covariance or correlation matrix does not guarantee positive definiteness 
of the resulting estimator. Our theorems, however, say that if the population 
matrices have a smallest eigenvalue bounded away from zero (uniformly in 
p), then asymptotically our estimators will yield positive definite matrices 
(in that case, the theorems also imply spectral norm consistency of S" 1 
for Sp 1 ). If, in practice, one encounters a nonpositive definite estimator, 
it is clear that the problem at hand should dictate the strategy to remedy 
this flaw. Three general ideas can nevertheless be applied: one might think 
of "projecting" the estimator on the cone of positive semidefinite matrices, 
using semi definite programming and probably a sparseness penalty. The 
feasibility of this idea depends of course on the dimensionality of the prob- 
lem and it is unlikely to work well (at this point in time) in truly high 
dimension. Another idea would be to do a singular value decomposition of 
the estimator, which is possible even in high dimension, since the estimator 
is by construction sparse, and hence falls within the reach of several fast 
algorithms in numerical linear algebra. Then one could keep a smaller rank 
approximation of S p as the final estimator, £/, by putting, for instance, 
all the negative eigenvalues of T, p to zero [or instead of a real g(p), with 
g(p) — > 0]. Note that £/ can also be shown to be a consistent estimator of the 
population covariance, in spectral norm, since — £ p |||2 — ► because the 
negative eigenvalues of S p have to converge to zero (otherwise — X p |||2 
would not tend to 0). The main drawback of such a solution to the posi- 
tive definiteness problem is that we may lose the sparsity of the estimator, 
a feature that is in general desirable. However, its spectral characteristics 
would be quite easy to obtain, even in high dimension. The third idea would 
be to consider for our estimator, in the case where T, p turns out to not 
be positive definite, the matrix = S p — A p (Xp)Id p , where A p (X p ) is the 
smallest eigenvalue of T, p . Since, as we just noted, |A p (£ p )| — ► 0, we see that 
— S p |||2 — * and hence — £ p |||2 tends to 0, too. Hence, £/ is oper- 
ator norm consistent. Note that it is also sparse, because adding a diagonal 
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matrix does not change anything about the sparsity of our estimator. So 
this is a sparse positive semidefinite and operator norm consistent estimator 
of Tip. 

Robustness issues. Finally, we note that the results of this paper suggest 
that acting entrywise on the sample covariance matrix is a way to create 
good estimators of E p . In particular, when other issues such as robustness 
or contamination by heavy-tailed data arise, using (entrywise) more robust 
estimators than the sample covariance is likely to give improved results. 

4. Conclusion. In this paper we have investigated the theoretical prop- 
erties of the idea of thresholding the entries of a sample covariance (or cor- 
relation) matrix to better estimate the population covariance, when it is 
assumed (or known) to be sparse. We have shown that the natural notion 
of sparsity, coming from problems concerning random vectors, is not ap- 
propriate when one is concerned with estimating matrices and in particular 
their spectral properties. By contrast, we propose an alternative notion of 
sparsity, based on properties of the graph corresponding to the adjacency 
matrix of the population covariance. We have shown that our notion of 
sparsity divides sharply classes of matrices that are estimable through hard 
thresholding and those that are not, an appealing property. The notion of 
sparsity we propose is invariant under permutation of the order of the vari- 
ables and hence is well suited for the analysis of problems where there is no 
canonical ordering of the variables. It is also related to the spectral norm of 
the adjacency matrix of the population covariance. 

We show that /3-sparse matrices, with (3 < 1/2, are consistently estimable 
in operator (a.k.a. spectral) norm, a strong notion of convergence that im- 
plies consistency of all eigenvalues and eigenspaces corresponding to eigen- 
values separated from the rest of the spectrum (see Section 3.5). Practically, 
the results of simulations are maybe not as striking as one may have hoped 
for, but lead to great improvement over the sample covariance (or correla- 
tion) matrix. 

We also show that certain nonsparse matrices are estimable by sparse 
matrices through the thresholding method we analyzed. Numerically, this 
method has many advantages in terms of implementation. It is easy to im- 
plement, and leads to sparse matrices, which have the desirable property 
that their eigenvalues and eigenvectors can be numerically computed effi- 
ciently, even in high dimension. Also, since the method acts in an entrywise 
fashion, the corresponding algorithm is easily parallelizable and in general 
produces results quickly. 

Statistically, our results mean that under the assumption of /?-sparsity, 
(3 < 1/2, applying the natural practical idea of thresholding the entries of 
a sparse matrix leads to good theoretical convergence properties. However, 
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we also show that in situations that are not inconceivable in practice, that 
is, (3 > 1/2, this strategy may sometimes fail to give an estimator as good 
as what we required. More sophisticated approaches may be needed in these 
more difficult cases, though, as noted above, the simple thresholding ap- 
proach has even then many practical virtues. 

APPENDIX 

A.l. 1/2-sparse matrices: details of computations. In what follows, we 
use the notation N for the quantity n — 1 (so N = n — 1) in an effort to 
alleviate the notation. The computations that follow are used in Section 3.3 
and the notation are defined there. Recall that = 1 + af and i > 2. We 
give a detailed explanation of our estimate of 

E(((2i - a,) 2 " n/N)^ - aj f - Vj/N)). 

Clearly, the only thing we need to control is E((5j — aj) 2 (Sj — aj) 2 ), since 
v-i/N and Vj/N are the means of (Sj — c^) 2 and (6y — aj) 2 . Note that we 
can write (Sj — aj) = J2k=i Z k {i)/N, where the Z k (iys are i.i.d. and mean 
0. Similarly, we can write (6tj — aj) = J2k=i ^fc(j)/-^- Note that Yfc(j) is 
independent of Z\ if k is different from /. Therefore, 

E((oi - a i ) 2 (a j - a,) 2 ) = ^E(j2Z kl (i)Z k2 (i)Y k3 (j)Y k4 (j)). 

In the previous sum if an index appears only once in the product, the ex- 
pectation is zero. So only terms where each index appears an even number 
of times will matter. 

We first focus on terms where we have two distinct indices; the contribu- 
tion of such terms is 

N ^ N N ~ 1) B(Z 2 Y 2 + Z 1 Z 2 Y 1 Y 2 + Z 1 Z 2 Y 2 Y 1 ). 

We can limit our investigations to the terms with two distinct indices since 
there are only N terms of the form Z 2 Y 2 , so their contribution will be 
asymptotically negligible. Now, E(Z 2 Y 2 2 ) = ViVj, by independence and defi- 
nition. Also, if X is multivariate Gaussian vector with covariance S p , 

E(Z 1 (i)Y 1 (j)) = E((A 1 A, - aJiXiXj - a,)) = E(X 2 XiXj - a^) 

= cr(l, l)a(i, j) + ctiOtj + OjOj — a^aj 

by using the fact that we are working with Gaussian random variables. 
Therefore, if j, 

E({(3,- Q ,) 2 -£}{( Sj - aj ) 2 -£}) 
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1 1 \ E(ZlY^) — Vii/j 



N 2 JS[3 J ' 



N 3 



In the case where o^ay = o(l/y/N), we see that this term is of order 1/n 3 . 



A. 2. Performance of estimator: graphical illustration. The images of 
this subsection illustrate the performance of the estimator, assessing visu- 
ally its variability and comparing it to the sample covariance matrix. All 
simulations were done with Gaussian data; the thresholding was made ac- 
cording to the FDR rule — in connection with z-tests — with FDR parameter 
1/y/p. Our illustrations focus on the properties of eigenvalues because they 
are easier to visualize. 

All matrices investigated are (symmetric) Toeplitz matrices, because of 
the ease with which they can be simulated. We did not randomly permute 
the "variables" because this would have had no effect on the performance 
of the estimator; in particular, the eigenvalues would be exactly the same. 
These matrices can be summarized by their first row, which is what we refer 
to when speaking of "coefficients" below. 

Case of a Toeplitz matrix, with n = p = 500, and coefficients (1,0.3,0.4, 
0, . . . , 0) . This situation should be fairly easy since the nonzero coefficients 
are quite large compared to the variance of <r(i, j)'s for those (i,j) for which 
o~(i,j) = 0. The results are illustrated in Figure 1(a). 

Case of a Toeplitz matrix, with n = p = 500, and coefficients (2,0.2,0.3, 
0, —0.4, 0, . . . , 0) . This situation is a bit harder than the one above a priori, 
as the nonzero coefficients are not as large compared to the variance of 
a(i,j)'s for those for which cr(i,j) = as they are in the previous 

example. The results are illustrated in Figure 2(a). 

Case of a nonsparse Toeplitz matrix, with n = 500, p = 100, and coef- 
ficients {0.3 fc }^Q. This situation illustrates the approximation of a non- 
sparse matrix by a sparse matrix. As seen above, this population covariance 
can be approximated in spectral norm by a 0-sparse matrix. In these types 
of situations, it is possible that thresholding might be a bit "harsh" and 
"smoother" regularization approaches might lead to better empirical results. 
The results are illustrated in Figure 3(a). 
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Thresholded Matrix: eigenvalues statistics pver 1000 repetitions Thresriolded Matrix: one rsahzation 




(a) (b) 

Fig. 1. Case of a Toeplitz (1, 0.3, 0.4, 0, . . . , 0) population covariance matrix E p , 
n=p = 500. The dashed lines correspond to the 0.025 and 0.975 quantiles of the em- 
pirical distribution of the kth eigenvalue, for k = 1 to p. The data were A/"(0, E p ) and the 
experiment was repeated 1000 times. As we can see, the estimator is very stable. It does 
well, especially "far" from the edges of the spectrum. For this particular E p , it can be 
explained by the fact that the nonzero coefficients in the matrix are easily detectable, when 
n = 500. The improvement over the sample covariance matrix is quite dramatic, (a) Vari- 
ability of estimator and population spectrum: scree plot of population and corresponding 
confidence bounds for ordered eigenvalues of our estimator, (b) Comparison between scree 
plot of our estimator (a.k.a. "Realization": the continuous line between the two dashed 
ones) and that of the sample covariance matrix on one realization, picked at random from 
our 1000 repetitions. 

A. 3. Some linear algebraic results. In the course of our proofs, we need 
the following two lemmas, which are also of independent interest. 

We first prove the following lemma, which we needed earlier in the paper. 



Lemma A.l. Suppose M is a p x p real symmetric matrix, which is 
(3 -sparse. Call m = maxjj |Mjj|. Then 

Vk e 2N |||M||| 2 < | trace(M fc )| 1 / fc < mp^ l - 1/k)+l/k {f{k)) l l k . 

Proof. In what follows, k is an even integer. Let 7 be a closed walk of 
length k. Then, Wy, its weight, clearly satisfies, according to Definition 2, 

1 1 1. 
\Wy \ <m . 

So, since 

trace (M k ) = ^ Wy, 
jec p (k) 

we clearly have 

|trace(M fe )| <m k (j) p (k). 
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Thrasholded Matrix: eigenvalues statistics over 1000' repetitions 



Threstioided Matrix: one realization 




4 
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(a) 



(b) 



Fig. 2. Case of a Toeplitz (2, 0.2, 0.3, 0, — 0.4, 0, . . . , 0) population covariance matrix E p , 
n = p — 500. The dashed lines correspond to the 0.025 and 0.975 quantiles of the empirical 
distribution of the kth eigenvalue, for k= 1 to p. The data were A/(0,S p ) and the experi- 
ment was repeated 1000 times. As we can see, the estimator is very stable. It does capture 
the support of the spectrum fairly accurately, but is not as good in capturing the fine details 
of the bulk. For this particular E p , there is (compared to the previous example of Figure 1) 
a certain lack of accuracy when estimating the adjacency matrix A p of E p , when n = 500. 
The improvement over the sample covariance matrix is quite dramatic, (a) Variability of 
estimator and population spectrum: scree plot of population and corresponding confidence 
bounds for ordered eigenvalues of our estimator, (b) Comparison between scree plot of 
our estimator (a.k.a. "Realization" : the continuous line between the two dashed ones) and 
that of the sample covariance matrix on one realization, picked at random from our 1000 
repetitions. 

Since we assume that M is /3-sparse, 



We now turn to another result we needed in the course of our proofs. 

Lemma A. 2. Suppose that A and B are two real symmetric p x p ma- 
trices, with \A(i,j)\ <B(i,j). Then, 



traced) = w 1 (A). 

7 ec p (fc) 

Now, we clearly have, if 7 is the walk i\ — ► %i —►■•■—> if. — ► i^+i = i±, 
\w 7 (A)\ = \A(ii,i 2 ) ■ ■■A(i k ,i k+1 )\ 
< \A(ii,i 2 )\ ■ ■ ■ \A(i k ,i k+1 )\ 
<B(i 1 ,i 2 )---B(i k ,i k+l ) =w y {B). 




□ 



P||| 2 <|||5||| 2 . 



Proof. Recall that, in the notation of Definition 2 
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Thresholded Matrix: eigenvalues statistics over 1000 repetitions Thresholded Matrix: one realization 
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(a) (b) 

Fig. 3. Case of a Toeplitz {0.3 fc }^~J population covariance matrix E p , n = 500, p = 100. 
The dashed lines correspond to the 0.025 and 0.975 quantiles of the empirical distribution 
of the kth eigenvalue, for k = 1 to p. The data were Af(0,T, p ) and the experiment was 
repeated 1000 times. As we can see, the estimator is very stable. The problem is harder for 
the thresholding technique than the one illustrated in Figure 1, and it is possible that less 
"harsh" regularizations might perform slightly better. The improvement over the sample 
covariance matrix is still quite dramatic, (a) Variability of estimator and population spec- 
trum: scree plot of population and corresponding confidence bounds for ordered eigenvalues 
of our estimator, (b) Comparison between scree plot of our estimator (a.k.a. "Realization": 
the continuous line between the two dashed ones) and that of the sample covariance matrix 
on one realization, picked at random. 

So, if |A|Had is the matrix with entry we have 

| traced)! < trace(|A|£r ad ) < trace(5 fc ). 

For real symmetric matrices, we have |||^4|||2 = lim/ ^ oo [trace(^4 2fc )] 1 /( 2fe \ and 
therefore we can conclude that 

|||A|||2<||||A|Had|||2<|||S||| 2 . □ 
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